This webpage provides the materials for a presentation that was given during a campus visit to Hamilton College on May 5, 2023
Please feel free to contact me at thlim@arizona.edu if you have any questions.
Remember
You can download the do-file here:
Download data_science_for_hamiltonians.do//load data set with the second row as the variable names
import delimited "NY-snowfall-202301.csv", varnames(2) clear
(38 vars, 469 obs)
tab jan1
Jan 1 | Freq. Percent Cum.
------------+-----------------------------------
0.0 | 254 54.16 54.16
M | 214 45.63 99.79
T | 1 0.21 100.00
------------+-----------------------------------
Total | 469 100.00
// Replace "M" (missing) and "T" (trace) with 0 in all columns of the data
foreach var of varlist jan* {
replace `var' = subinstr(`var', "M", "0", .)
replace `var' = subinstr(`var', "T", "0", .)
}
tab jan1
(214 real changes made)
(1 real change made)
(142 real changes made)
(12 real changes made)
(139 real changes made)
(2 real changes made)
(205 real changes made)
(0 real changes made)
(210 real changes made)
(0 real changes made)
(185 real changes made)
(25 real changes made)
(177 real changes made)
(60 real changes made)
(112 real changes made)
(77 real changes made)
(113 real changes made)
(38 real changes made)
(106 real changes made)
(60 real changes made)
(95 real changes made)
(47 real changes made)
(150 real changes made)
(75 real changes made)
(166 real changes made)
(39 real changes made)
(157 real changes made)
(54 real changes made)
(120 real changes made)
(95 real changes made)
(89 real changes made)
(20 real changes made)
(122 real changes made)
(14 real changes made)
(187 real changes made)
(33 real changes made)
(147 real changes made)
(19 real changes made)
(180 real changes made)
(46 real changes made)
(152 real changes made)
(52 real changes made)
(110 real changes made)
(85 real changes made)
(121 real changes made)
(16 real changes made)
(125 real changes made)
(78 real changes made)
(88 real changes made)
(92 real changes made)
(134 real changes made)
(15 real changes made)
(95 real changes made)
(55 real changes made)
(117 real changes made)
(65 real changes made)
(113 real changes made)
(28 real changes made)
(126 real changes made)
(55 real changes made)
(108 real changes made)
(22 real changes made)
Jan 1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 215 45.84 45.84
0.0 | 254 54.16 100.00
------------+-----------------------------------
Total | 469 100.00
// Check data type
describe jan1
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
jan1 str3 %9s Jan 1
// Convert columns starting with Jan. to numeric
foreach var of varlist jan* {
destring `var', replace
}
// Check data type
describe jan1
tab jan1
jan1 has all characters numeric; replaced as byte
jan2 has all characters numeric; replaced as double
jan3 has all characters numeric; replaced as byte
jan4 has all characters numeric; replaced as byte
jan5 has all characters numeric; replaced as double
jan6 has all characters numeric; replaced as double
jan7 has all characters numeric; replaced as double
jan8 has all characters numeric; replaced as double
jan9 has all characters numeric; replaced as double
jan10 has all characters numeric; replaced as double
jan11 has all characters numeric; replaced as double
jan12 has all characters numeric; replaced as double
jan13 has all characters numeric; replaced as double
jan14 has all characters numeric; replaced as double
jan15 has all characters numeric; replaced as double
jan16 has all characters numeric; replaced as double
jan17 has all characters numeric; replaced as double
jan18 has all characters numeric; replaced as double
jan19 has all characters numeric; replaced as double
jan20 has all characters numeric; replaced as double
jan21 has all characters numeric; replaced as double
jan22 has all characters numeric; replaced as double
jan23 has all characters numeric; replaced as double
jan24 has all characters numeric; replaced as double
jan25 has all characters numeric; replaced as double
jan26 has all characters numeric; replaced as double
jan27 has all characters numeric; replaced as double
jan28 has all characters numeric; replaced as double
jan29 has all characters numeric; replaced as double
jan30 has all characters numeric; replaced as double
jan31 has all characters numeric; replaced as double
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
jan1 byte %10.0g Jan 1
Jan 1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 469 100.00 100.00
------------+-----------------------------------
Total | 469 100.00
// Add a new column with the sum of each row
egen total = rowtotal(jan1 - jan31)
sum total,d
total
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 469
25% 1.1 0 Sum of Wgt. 469
50% 7.5 Mean 7.889765
Largest Std. Dev. 6.822664
75% 11.8 29.7
90% 16.4 31.4 Variance 46.54874
95% 19.4 32.9 Skewness .8291596
99% 29.5 33.8 Kurtosis 3.733055
// Check if Elevation and Latitude are numeric
describe elevation latitude
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
elevation int %8.0g Elevation
latitude float %9.0g Latitude
sum elevation, d
Elevation
-------------------------------------------------------------
Percentiles Smallest
1% 5 -9999
5% 30 3
10% 100 4 Obs 469
25% 380 4 Sum of Wgt. 469
50% 642 Mean 743.5416
Largest Std. Dev. 723.4853
75% 1110 2029
90% 1568 2169 Variance 523431
95% 1698 2220 Skewness -6.733887
99% 2021 2495 Kurtosis 104.8178
tab stationname if elevation == -9999
Station Name | Freq. Percent Cum.
-------------------------------+-----------------------------------
BATAVIA 1.4 E | 1 100.00 100.00
-------------------------------+-----------------------------------
Total | 1 100.00
recode elevation -9999=900
(elevation: 1 changes made)
twoway scatter total elevation, ///
xtitle("Elevation") ytitle("Total") ///
title("Scatterplot of Total vs. Elevation")
quietly graph export scatter_total_elevation.png, replace
> xtitle("Elevation") ytitle("Total") ///
> title("Scatterplot of Total vs. Elevation")
Total ~ Elevation
twoway (scatter total latitude) (lfit total latitude), ///
xtitle("Latitude") ytitle("Total") ///
title("Scatterplot of Total vs. Latitude")
//quietly graph export lfit_total_latitude.png, replace
> xtitle("Latitude") ytitle("Total") ///
> title("Scatterplot of Total vs. Latitude")
Total ~ Latitude
reg total elevation latitude
Source | SS df MS Number of obs = 469
-------------+---------------------------------- F(2, 466) = 197.88
Model | 10004.6003 2 5002.30013 Prob > F = 0.0000
Residual | 11780.211 466 25.2794228 R-squared = 0.4592
-------------+---------------------------------- Adj R-squared = 0.4569
Total | 21784.8113 468 46.548742 Root MSE = 5.0279
------------------------------------------------------------------------------
total | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
elevation | .0051263 .0004693 10.92 0.000 .0042041 .0060484
latitude | 3.313543 .2761122 12.00 0.000 2.770963 3.856122
_cons | -136.9432 11.62777 -11.78 0.000 -159.7926 -114.0939
------------------------------------------------------------------------------
margins, at(latitude=(40.54(1)44.84))
Predictive margins Number of obs = 469
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : latitude = 40.54
2._at : latitude = 41.54
3._at : latitude = 42.54
4._at : latitude = 43.54
5._at : latitude = 44.54
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 1.318498 .5947581 2.22 0.027 .1497579 2.487238
2 | 4.63204 .3571999 12.97 0.000 3.930118 5.333962
3 | 7.945583 .2322118 34.22 0.000 7.489271 8.401895
4 | 11.25913 .3643196 30.90 0.000 10.54321 11.97504
5 | 14.57267 .6033334 24.15 0.000 13.38708 15.75826
------------------------------------------------------------------------------
marginsplot
quietly graph export marginsplot_latitude.png, replace
Variables that uniquely identify margins: latitude
Marginal Effect of Latitude
margins, at(elevation=(0(100)2500))
Predictive margins Number of obs = 469
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : elevation = 0
2._at : elevation = 100
3._at : elevation = 200
4._at : elevation = 300
5._at : elevation = 400
6._at : elevation = 500
7._at : elevation = 600
8._at : elevation = 700
9._at : elevation = 800
10._at : elevation = 900
11._at : elevation = 1000
12._at : elevation = 1100
13._at : elevation = 1200
14._at : elevation = 1300
15._at : elevation = 1400
16._at : elevation = 1500
17._at : elevation = 1600
18._at : elevation = 1700
19._at : elevation = 1800
20._at : elevation = 1900
21._at : elevation = 2000
22._at : elevation = 2100
23._at : elevation = 2200
24._at : elevation = 2300
25._at : elevation = 2400
26._at : elevation = 2500
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | . (not estimable)
2 | 5.984411 .5999464 9.97 0.000 4.805462 7.163359
3 | 5.60585 .4222693 13.28 0.000 4.776053 6.435648
4 | 5.453481 .3367075 16.20 0.000 4.791821 6.115142
5 | 5.502861 .3223978 17.07 0.000 4.86932 6.136402
6 | 5.729548 .3373162 16.99 0.000 5.066691 6.392405
7 | 6.1091 .3529676 17.31 0.000 5.415487 6.802713
8 | 6.617076 .3605384 18.35 0.000 5.908586 7.325567
9 | 7.229034 .3614414 20.00 0.000 6.518769 7.939299
10 | 7.920532 .3610039 21.94 0.000 7.211127 8.629937
11 | 8.667128 .3648421 23.76 0.000 7.950181 9.384076
12 | 9.444381 .3762415 25.10 0.000 8.705032 10.18373
13 | 10.22785 .3949241 25.90 0.000 9.451787 11.00391
14 | 10.99309 .417921 26.30 0.000 10.17184 11.81434
15 | 11.71566 .4419586 26.51 0.000 10.84717 12.58415
16 | 12.37112 .4662718 26.53 0.000 11.45486 13.28739
17 | 12.93503 .49549 26.11 0.000 11.96135 13.90871
18 | 13.38294 .5421034 24.69 0.000 12.31766 14.44823
19 | 13.69042 .6260242 21.87 0.000 12.46023 14.92062
20 | 13.83302 .7685783 18.00 0.000 12.3227 15.34335
21 | 13.7863 .9852521 13.99 0.000 11.8502 15.72241
22 | 13.52582 1.284649 10.53 0.000 11.00137 16.05027
23 | 13.02714 1.671993 7.79 0.000 9.741524 16.31276
24 | 12.26581 2.151945 5.70 0.000 8.037047 16.49458
25 | 11.2174 2.729625 4.11 0.000 5.853439 16.58136
26 | 9.857454 3.410736 2.89 0.004 3.155052 16.55986
------------------------------------------------------------------------------
marginsplot
quietly graph export marginsplot_elevation.png, replace
Variables that uniquely identify margins: elevation
Marginal Effect of Elevation
//reshape data to long
reshape long jan, i(ghcnid) j(date) string
rename jan inches
(note: j = 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30
> 31 4 5 6 7 8 9)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 469 -> 14539
Number of variables 39 -> 10
j variable (31 values) -> date
xij variables:
jan1 jan10 ... jan9 -> jan
-----------------------------------------------------------------------------
//destring date
destring date, replace
date has all characters numeric; replaced as byte
line inches date if stationname == "NEW HARTFORD 0.8 S"
quietly graph export line_08S.png, replace
Line Graph of New Hartford 0.8 S
scatter inches date if stationname == "NEW HARTFORD 1.0 WSW"
quietly graph export scatter_10WSW.png, replace
Scatter Plot of New Hartford 1.0 WSW
twoway (scatter inches date if stationname == "NEW HARTFORD 0.8 S") (line inches date if stationname == "NEW HARTFORD 0.8 S")
quietly graph export scatter_line_08S.png, replace
> hes date if stationname == "NEW HARTFORD 0.8 S")
Scatter Plot and Line Graph of New Hartford 0.8 S
Remember
You can download the rscript here:
Download data_science_for_hamiltonians.R####Load and Reshape Data####
# Attach libraries
library(tidyverse)
####get total snowfall by GHCN####
# Load the CSV file of snowfall
# Read from csv with the first row skipped and with the second row as the header of the dataframe
snowfall <- read.csv("NY-snowfall-202301.csv", skip = 1, header = TRUE)
# Display column names
colnames(snowfall)
[1] "GHCN.ID" "Station.Name" "County" "State" "Elevation"
[6] "Latitude" "Longitude" "Jan.1" "Jan.2" "Jan.3"
[11] "Jan.4" "Jan.5" "Jan.6" "Jan.7" "Jan.8"
[16] "Jan.9" "Jan.10" "Jan.11" "Jan.12" "Jan.13"
[21] "Jan.14" "Jan.15" "Jan.16" "Jan.17" "Jan.18"
[26] "Jan.19" "Jan.20" "Jan.21" "Jan.22" "Jan.23"
[31] "Jan.24" "Jan.25" "Jan.26" "Jan.27" "Jan.28"
[36] "Jan.29" "Jan.30" "Jan.31"
# Display top five rows
head(snowfall)
GHCN.ID Station.Name County State Elevation Latitude
1 US1NYER0211 AKRON 3.6 W ERIE New York 641 43.02
2 US1NYAB0041 ALBANY 0.3 ESE ALBANY New York 213 42.66
3 US1NYAB0023 ALBANY 0.7 SW ALBANY New York 256 42.66
4 US1NYAB0056 ALBANY 2.4 WNW ALBANY New York 276 42.68
5 USW00014735 ALBANY INTERNATIONAL AIRPORT ALBANY New York 280 42.75
6 US1NYNS0042 ALBERTSON 0.2 SSE NASSAU New York 142 40.77
Longitude Jan.1 Jan.2 Jan.3 Jan.4 Jan.5 Jan.6 Jan.7 Jan.8 Jan.9 Jan.10 Jan.11
1 -78.57 M M M M M M M M 0.0 0.0 0.0
2 -73.79 0.0 0.0 0.0 0.0 0.0 M M 0.0 0.0 T 0.0
3 -73.81 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T 0.1
4 -73.84 M M 0.0 M M M M 0.0 0.0 T 0.0
5 -73.80 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 T T 0.0
6 -73.65 M 0.0 M M M M M 0.0 M 0.0 0.0
Jan.12 Jan.13 Jan.14 Jan.15 Jan.16 Jan.17 Jan.18 Jan.19 Jan.20 Jan.21 Jan.22
1 M M M 0.0 0.0 M M 0.0 M M M
2 T M 0.1 T 0.0 0.0 0.0 0.0 M M T
3 T 0.3 T T 0.0 0.0 T 0.0 0.5 0.1 T
4 M M 0.2 T 0.0 0.0 0.0 0.0 0.4 M M
5 0.2 0.2 T T 0.0 T 0.0 0.3 0.5 T 2.0
6 M M M M M 0.0 M 0.0 M 0.0 0.0
Jan.23 Jan.24 Jan.25 Jan.26 Jan.27 Jan.28 Jan.29 Jan.30 Jan.31
1 M M 0.0 M M 0.0 0.0 0.0 M
2 M M T 0.4 T T 0.0 0.0 0.5
3 3.8 3.0 T 1.0 T T 0.0 0.0 T
4 4.2 3.6 T 1.1 T 0.0 0.0 0.0 0.5
5 6.0 T 1.1 M T T 0.0 0.6 0.1
6 M 0.0 0.0 M 0.0 M 0.0 0.0 M
# Replace "M" (missing) and "T" (trace) with 0 in all columns of the dataframe
snowfall <- snowfall %>%
mutate_all(~ str_replace_all(., c("M" = "0", "T" = "0")))
# Check data type
snowfall$Jan.1 %>% typeof()
[1] "character"
# Convert columns starting with Jan. to numeric
snowfall <- snowfall %>%
mutate_at(vars(starts_with("Jan.")), as.numeric)
# Add a new column with the sum of each row
snowfall <- snowfall %>%
mutate(total = rowSums(select(., starts_with("Jan."))))
####Produce an interactive graph of top 15####
# Attach libraries
library(ggplot2)
library(plotly)
# Select top 15 stations and reorder factor levels
top_stations <- snowfall %>%
top_n(15, total) %>%
mutate(Station.Name = fct_reorder(Station.Name, desc(total))) %>%
rename(Station = Station.Name, Snowfall = total)
# Create ggplot object with custom color scheme
p <- ggplot(top_stations, aes(x = Station, y = Snowfall, text = paste("Snowfall: ", Snowfall))) +
geom_bar(stat = "identity", aes(fill = Snowfall)) +
scale_fill_gradient(low = "grey", high = "blue") +
labs(title = "Top 15 Stations with the Most Snowfall in January",
x = "Station",
y = "Total January Snowfall (in)",
fill = "Total\nSnowfall") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
# Convert ggplot to interactive plot
ggplotly(p + theme(legend.position = "none"), tooltip = "text",
hovertemplate = paste("<b>%{x}</b><br>",
"Snowfall: %{y}<br>")) %>%
layout(hoverlabel = list(bgcolor = c(low = "grey", high = "blue")))
####Produce an interactive graph by date####
# Attach libraries
library(ggplot2)
library(plotly)
# Leave the columns with the ID and dates
new_snowfall <- snowfall %>% select(Station.Name, starts_with("Jan."))
# tidy
tidy_snowfall <- new_snowfall %>%
gather(key = "date", value = "Snowfall", -Station.Name)
# Change Jan. to YYYY-MM-DD format
tidy_snowfall$ymd <- as.POSIXct(paste0("2023-", sub("Jan\\.", "01-", tidy_snowfall$date)), format = "%Y-%m-%d")
tidy_snowfall$Date <- ymd(tidy_snowfall$ymd)
tidy_snowfall$Station <- tidy_snowfall$Station.Name
# Create ggplot object
ggp <- ggplot(tidy_snowfall, aes(x = Date, y = Snowfall, color = Station)) +
geom_line() +
labs(title = "Snowfall by Stations",
x = "Date",
y = "Snowfall (in)") +
theme_bw() +
theme(legend.position = "none") +
scale_x_date(date_labels = "%b/%d")
ggp
# Convert to plotly object for interactivity
ggplotly(ggp)
####Produce an interactive map of snowfall by NYS assembly districts####
# Attach libraries
library(sf)
library(tmap)
library(leaflet)
# Read in the shapefile for the New York State Assembly Districts
ad_shapefile <- st_read("https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
Reading layer `OGRGeoJSON' from data source
`https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Assembly_Districts/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson'
using driver `GeoJSON'
Simple feature collection with 150 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -79.76259 ymin: 40.47658 xmax: -71.77749 ymax: 45.01586
Geodetic CRS: WGS 84
# Create a spatial points dataframe
snowfall_sf <- st_as_sf(snowfall, coords = c("Longitude", "Latitude"), crs = 4326)
# Perform spatial join
snowfall_by_district <- st_join(snowfall_sf, ad_shapefile, join = st_intersects)
# Summarize snowfall by district
snowfall_summarized <- snowfall_by_district %>%
group_by(District) %>%
summarize(total_snowfall = sum(total))
# Count the number of observatories in each district
observatory_counts <- snowfall_by_district %>%
group_by(District) %>%
summarize(observatory_count = n())
# Add the observatory count to the snowfall_summarized data frame
snowfall_summarized <- st_join(snowfall_summarized, observatory_counts)
# Calculate the average snowfall for each district
snowfall_averages <- snowfall_summarized %>%
mutate(avg_snowfall = round(total_snowfall / observatory_count, 2)) %>% #create a new column with average snowfall that displys two decimal digits
select(-District.y) %>%
rename(District = District.x)
# Create a shapefile containing the information on snowfall, sponsoring, and the number of mentions
nyassembly <- st_join(ad_shapefile, snowfall_averages) %>%
select(-c(District.y, District.x)) %>%
mutate(OBJECTID = OBJECTID - 150)
# Define the color palette
pal <- leaflet::colorNumeric(palette = c("grey", "blue"), domain = nyassembly$avg_snowfall)
# Create the leaflet map
leaflet <- leaflet() %>%
addTiles() %>%
setView(lng = -75.3, lat = 43, zoom = 6.7) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = nyassembly,
fillColor = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),
weight = 0.3,
color = ~ifelse(is.na(avg_snowfall), "#FFFFFF", pal(avg_snowfall)),
opacity = 1.0,
fillOpacity = 0.5,
label = ~paste("District: ", OBJECTID, ", ",
"Average Snowfall: ", format(round(avg_snowfall, 2), nsmall = 2)),
highlightOptions = leaflet::highlightOptions(
weight = 1.5,
color = "white",
fillOpacity = 0,
bringToFront = TRUE)) %>%
addLegend("bottomleft",
pal = pal,
values = nyassembly$avg_snowfall,
title = "Average Snowfall",
opacity = 0.8) %>%
addScaleBar(position = "topright")
Average January Snowfall in Inches by NYS Assembly Districts
leaflet
# Check data type
is.numeric(snowfall$Elevation)
[1] FALSE
is.numeric(snowfall$Latitude)
[1] FALSE
# Correct data type
snowfall$Elevation <- as.numeric(snowfall$Elevation)
snowfall$Latitude <- as.numeric(snowfall$Latitude)
# Linear regression
model <- lm(total ~ Elevation + Latitude, data = snowfall)
summary(model)
Call:
lm(formula = total ~ Elevation + Latitude, data = snowfall)
Residuals:
Min 1Q Median 3Q Max
-13.9577 -2.7186 0.2336 2.4818 27.4458
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.582e+02 1.202e+01 -13.162 < 2e-16 ***
Elevation 2.530e-03 3.506e-04 7.216 2.18e-12 ***
Latitude 3.861e+00 2.839e-01 13.601 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.344 on 466 degrees of freedom
Multiple R-squared: 0.3891, Adjusted R-squared: 0.3864
F-statistic: 148.4 on 2 and 466 DF, p-value: < 2.2e-16